我们报告了激进的量化策略,这些策略极大地加速了复发性神经网络传感器(RNN-T)的推理。我们使用4位整数表示进行权重和激活,并应用量化意识训练(QAT)来重新训练完整模型(声学编码器和语言模型)并实现近乎ISO的准确性。我们表明,根据网络本地属性量身定制的自定义量化方案对于在限制QAT的计算开销的同时,至关重要。密度比语言模型融合已显示出在RNN-T工作负载上的准确性提高,但严重增加了推理的计算成本。我们表明,我们的量化策略可以使用大型宽度宽度进行假设搜索,同时实现与流媒体兼容的运行时间,并且与完整的Precision模型相比,我们可以实现与流相兼容的运行时间和7.6 $ \ times $的完整模型压缩比。通过硬件仿真,我们估计端到端量化的RNN-T(包括LM Fusion)的3.4 $ \ times $从fp16到INT4,导致实时因子(RTF)为0.06。在NIST HUB5 2000,HUB5 2001和RT-03测试集中,我们保留了与LM Fusion相关的大部分收益,将平均WER提高了$ 1.5%。
translated by 谷歌翻译
代码切换(CS)是多语言社区中的常见语言现象,其包括在说话时在语言之间切换。本文提出了我们对普通话 - 英语CS演讲的结束地理识别的调查。我们分析了不同的CS特定问题,例如CS语言对中语言之间的属性不匹配,切换点的不可预测性质,以及数据稀缺问题。通过使用分层Softmax的语言识别通过建模子字单元来利用非语言识别来利用非统计符号来利用和改善最先进的端到端系统,通过人为地降低说话率,并通过使用增强数据来实现子字单元。使用速度扰动技术和几个单机数据集不仅可以在CS语音上提高最终性能,还可以在单​​格式基准上,以使系统更适用于现实生活环境。最后,我们探讨了不同语言模型集成方法对提出模型性能的影响。我们的实验结果表明,所有提出的技术都提高了识别性能。最佳组合系统在混合误差率方面将基线系统提高到35%,并在单机基准上提供可接受的性能。
translated by 谷歌翻译
本文提出了我们在改进患有数据稀缺的代码切换语言模型的最新努力。我们调查通过人为生成它们来增加代码切换培训文本数据的方法。具体地,我们提出了一种基于循环一致的对手网络的基于框架,将单晶文本传输到代码切换文本中,考虑代码切换为讲话方式。我们在Seame Corpus上的实验结果表明,利用人工生成的码切换文本数据始终如一地提高语言模型以及自动语音识别性能。
translated by 谷歌翻译
本文提出了通过语音增强改善嘈杂演讲的自动语音识别的最新调查。我们提出了一种名为Multi-Coldiminators CycliCan的新型方法,以降低输入语音的噪声,从而提高自动语音识别性能。我们所提出的方法利用了语音增强的Cycleangan框架而无需任何并行数据,并通过引入检查不同频率区域的多个鉴别器来改进它。此外,我们表明,在训练数据的同类子集上训练多个发电机比所有训练数据上的一个发电机更好。我们在CHIME-3数据集中评估我们的方法,并在评估集上观察到开发集的提高高达10.03%,高达14.09%。
translated by 谷歌翻译
以前的研究已经证实了利用明晰度信息达到改善的语音增强(SE)性能的有效性。通过使用铰接特征的地点/方式增强原始声学特征,可以引导SE过程考虑执行增强时输入语音的剖视特性。因此,我们认为关节属性的上下文信息应包括有用的信息,并可以进一步利用不同的语言。在这项研究中,我们提出了一个SE系统,通过优化英语和普通话的增强演讲中的上下文清晰度信息来提高其性能。我们通过联合列车与端到端的自动语音识别(E2E ASR)模型进行联合列车,预测广播序列(BPC)而不是单词序列的序列。同时,开发了两种培训策略,以基于基于BPC的ASR:多任务学习和深度特征培训策略来培训SE系统。 Timit和TMhint DataSet上的实验结果证实了上下文化学信息促进了SE系统,以实现比传统声学模型(AM)更好的结果。此外,与用单声道ASR培训的另一SE系统相比,基于BPC的ASR(提供上下文化学信息)可以在不同的信噪比(SNR)下更有效地改善SE性能。
translated by 谷歌翻译
本文介绍了模拟设计空间搜索的新观点。为了最大限度地减少上市时间,这一努力将更好地推广为基于现有技术中定义的全局优化的约束满足问题。我们纳入了模型的代理,与无模型学习形成对比,实施信任区域策略。因此,可以通过监督学习培训简单的前馈网络,其中收敛相对较大。实验结果证明了搜索迭代的数量级。另外,容纳了对PVT条件的前所未有的考虑。在TSMC 5/6NM过程的电路上,我们的方法实现了人类设计师的性能。此外,该框架在工业环境中的生产中。
translated by 谷歌翻译
Deep learning models can achieve high accuracy when trained on large amounts of labeled data. However, real-world scenarios often involve several challenges: Training data may become available in installments, may originate from multiple different domains, and may not contain labels for training. Certain settings, for instance medical applications, often involve further restrictions that prohibit retention of previously seen data due to privacy regulations. In this work, to address such challenges, we study unsupervised segmentation in continual learning scenarios that involve domain shift. To that end, we introduce GarDA (Generative Appearance Replay for continual Domain Adaptation), a generative-replay based approach that can adapt a segmentation model sequentially to new domains with unlabeled data. In contrast to single-step unsupervised domain adaptation (UDA), continual adaptation to a sequence of domains enables leveraging and consolidation of information from multiple domains. Unlike previous approaches in incremental UDA, our method does not require access to previously seen data, making it applicable in many practical scenarios. We evaluate GarDA on two datasets with different organs and modalities, where it substantially outperforms existing techniques.
translated by 谷歌翻译
The development of social media user stance detection and bot detection methods rely heavily on large-scale and high-quality benchmarks. However, in addition to low annotation quality, existing benchmarks generally have incomplete user relationships, suppressing graph-based account detection research. To address these issues, we propose a Multi-Relational Graph-Based Twitter Account Detection Benchmark (MGTAB), the first standardized graph-based benchmark for account detection. To our knowledge, MGTAB was built based on the largest original data in the field, with over 1.55 million users and 130 million tweets. MGTAB contains 10,199 expert-annotated users and 7 types of relationships, ensuring high-quality annotation and diversified relations. In MGTAB, we extracted the 20 user property features with the greatest information gain and user tweet features as the user features. In addition, we performed a thorough evaluation of MGTAB and other public datasets. Our experiments found that graph-based approaches are generally more effective than feature-based approaches and perform better when introducing multiple relations. By analyzing experiment results, we identify effective approaches for account detection and provide potential future research directions in this field. Our benchmark and standardized evaluation procedures are freely available at: https://github.com/GraphDetec/MGTAB.
translated by 谷歌翻译
As one of the prevalent methods to achieve automation systems, Imitation Learning (IL) presents a promising performance in a wide range of domains. However, despite the considerable improvement in policy performance, the corresponding research on the explainability of IL models is still limited. Inspired by the recent approaches in explainable artificial intelligence methods, we proposed a model-agnostic explaining framework for IL models called R2RISE. R2RISE aims to explain the overall policy performance with respect to the frames in demonstrations. It iteratively retrains the black-box IL model from the randomized masked demonstrations and uses the conventional evaluation outcome environment returns as the coefficient to build an importance map. We also conducted experiments to investigate three major questions concerning frames' importance equality, the effectiveness of the importance map, and connections between importance maps from different IL models. The result shows that R2RISE successfully distinguishes important frames from the demonstrations.
translated by 谷歌翻译
Compressed videos often exhibit visually annoying artifacts, known as Perceivable Encoding Artifacts (PEAs), which dramatically degrade video visual quality. Subjective and objective measures capable of identifying and quantifying various types of PEAs are critical in improving visual quality. In this paper, we investigate the influence of four spatial PEAs (i.e. blurring, blocking, bleeding, and ringing) and two temporal PEAs (i.e. flickering and floating) on video quality. For spatial artifacts, we propose a visual saliency model with a low computational cost and higher consistency with human visual perception. In terms of temporal artifacts, self-attention based TimeSFormer is improved to detect temporal artifacts. Based on the six types of PEAs, a quality metric called Saliency-Aware Spatio-Temporal Artifacts Measurement (SSTAM) is proposed. Experimental results demonstrate that the proposed method outperforms state-of-the-art metrics. We believe that SSTAM will be beneficial for optimizing video coding techniques.
translated by 谷歌翻译